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Abstract 

■ We investigate extensions of well-known online learning algorithms such as fixed-share of Herbster and Warmuth 
(1998) or the methods proposed by Bousquet and Warmuth (2002). These algorithms use weight 

sharing schemes to perform as well as the best sequence of experts with a limited number of 
q I changes. Here we show, with a common, general, and simpler analysis, that weight sharing in 

'— '> fact achieves much more than what it was designed for. We use it to simultaneously prove new 

shifting regret bounds for online convex optimization on the simplex in terms of the total variation 
distance as well as new bounds for the related setting of adaptive regret. Finally, we exhibit the first 

(**T} \ logarithmic shifting bounds for exp-concave loss functions on the simplex. 

■ Keywords: Prediction with expert advice, online convex optimization, tracking the best expert, 
, shifting experts. 

c4 ■ 

O 1- Introduction 

r— I ■ Online convex optimization is a sequential prediction paradigm in which, at each time step, the 

J> . learner chooses an element from a fixed convex set S and then is given access to a convex loss 

*kp | function defined on the same set. The value of the function on the chosen element is the learner's 

loss. Many problems such as prediction with expert advice, sequential investment, and online re- 
gression/classification can be viewed as special cases of this general framework. Online learning 
algorithms are designed to minimize the regret. The standard notion of regret is the difference 
between the learner's cumulative loss and the cumulative loss of the single best element in S. A 
much harder criterion to minimize is shifting regret, which is defined as the difference between the 
learner's cumulative loss and the cumulative loss of an arbitrary sequence of elements in S. Shift- 
ing regret bounds are typically expressed in terms of the shift, a notion of regularity measuring the 
length of the trajectory in S described by the comparison sequence (i.e., the sequence of elements 
against which the regret is evaluated). 

In online convex optimization, shifting regret bounds for convex subsets 5C| d are obtained 
for the online mirror descent (or follow-the-regularized-leader) algorithm. In this case the shift is 
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typically computed in terms of the p-norm of the difference of consecutive elements in the com- 
parison sequence — see Herbster and Warmuth (2001) and Cesa-Bianchi and Lugosi (2006). In this 
paper we focus on the important special case when S is the simplex, and investigate the online 
mirror descent with entropic regularizes. This family includes popular algorithms such as expo- 
nentially weighted average (EWA), Winnow, and exponentiated gradient. Proving general shifting 
bounds in this case is difficult due to the behavior of the regularizer at the boundary of the simplex. 
Herbster and Warmuth (2001) show shifting bounds for mirror descent with entropic regularizers 
using a 1-norm to measure the shift. In order to keep mirror descent from choosing points too close 
to the simplex boundary, they use a complex dynamic projection technique. When the comparison 
sequence is restricted to the corners of the simplex (which is the setting of prediction with expert 
advice), then the shift is naturally defined to be the number times the trajectory moves to a different 
corner. This problem is often called "tracking the best expert" — see, e.g., Herbster and Warmuth 
(1998); Vovk (1999); Herbster and Warmuth (2001); Bousquet and Warmuth (2002); Gyorgy et al. 
(2005), and it is well known that EWA with weight sharing, which corresponds to the fixed- 
share algorithm of Herbster and Warmuth (1998), achieves a good shifting bound in this setting. 
Bousquet and Warmuth (2002) introduce a generalization of the fixed-share algorithm, and prove 
various shifting bounds for any trajectory in the simplex. However, their bounds are expressed 
using a quantity that corresponds to a proper shift only for trajectories on the simplex corners. 

Our analysis unifies, generalizes (and simplifies) the previously quite different proof techniques 
and algorithms used in Herbster and Warmuth (1998) and Bousquet and Warmuth (2002). Our 
bounds are expressed in terms of a notion of shift based on the total variation distance. The gener- 
alization of the "small expert set" result in Bousquet and Warmuth (2002) leads us to obtain better 
bounds when the sequence against which the regret is measured is sparse. When the trajectory is 
restricted to the corners of the simplex, we recover, and occasionally improve, the known shifting 
bounds for prediction with expert advice. Besides, our analysis also captures the setting of adap- 
tive regret, a related notion of regret introduced by Hazan and Seshadhri (2009). It was known 
that shifting regret and adaptive regret had some connections but this connection is now seen to be 
even tighter, as both regrets can be viewed as instances of the same alma mater regret, which we 
minimize. Finally, we also show how to dynamically tune the parameters of our algorithms and re- 
view briefly the special case of exp-concave loss functions, exhibiting the first logarithmic shifting 
bounds for exp-concave loss functions on the simplex. 

2. Preliminaries 

We first define the sequential learning framework we work with. Even though our results hold in 
the general setting of online convex optimization, we present them in the, somewhat simpler, online 
linear optimization setup. We point out in Section 6 how these results may be generalized. Online 
linear optimization may be cast as a repeated game between the forecaster and the environment as 
follows. We use to denote the simplex g [0, : ||q||i = l}. 

Online linear optimization. For each round t = 1, . . . , T, 

1. Forecaster chooses p t = . . . , p djt ) e A d ; 

2. Environment chooses a loss vector £ t = (£i t t, • • • , £d,t) £ [0; l] d ; 

3. Forecaster suffers loss p t £ t . 
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The goal of the forecaster is to minimize the accumulated loss Lt = Y^t=i pj^t- m the now 
classical problem of prediction with expert advice, the goal of the forecaster is to compete with the 
best fixed component (often called "expert") chosen in hindsight, that is, with rnini = i t Ylt=i ^*>* - 
The focus of this paper is on more ambitious forecasters that compete with a richer class of se- 
quences of components. Let [d] = {1, . . . , d}. We use ij = . . . , i?) to denote a sequence in 
[d] T and let L T (%) = J2j= 

x £i t be the cumulative linear loss of the sequence ij £ [d] T . 
We start by introducing our main algorithmic tool, a generalized share algorithm. It is parametrized 
by the "mixing functions" tp t : [0, — y A^ for t = 1, . . . ,T that assign probabilities to past 
"pre-weights" as defined below. In all examples discussed in this paper, these mixing functions are 
quite simple but working with such a general model makes the main ideas more transparent. We 
then provide a simple lemma that serves as the starting point for analyzing different instances of the 
generalized share algorithm. 



Algorithm 1: The generalized share algorithm. 

Parameters: learning rate rj > and mixing functions ipt for t = 1, . . . ,T 
Initialization: p 1 = v\ = (l/d, . . . , l/d) 

For each round t = 1, . . . , T, 

1 . Predict p t ; 

2. Observe loss £ t £ [0, l] d ; 

3. [loss update] For each j = 1, . . . , d define 

Vj t+i = — -ir 1 } — tne current pre-weights, 

Y t+ i = [vi :S \ ig j rf ] 1<s<(+1 the d X (t + 1) matrix of all past and current pre-weights; 

4. [shared update] Define p t+1 = ipt+i(¥t+i)- 



Lemma 1 For all t ^ 1 and for all q t £ A^, Algorithm 1 satisfies 



d 



[Pt -Qt) £ t^~J2 qi >t ln 
Proof By Hoeffding's inequality, 



i=i ' \i=i / 



(1) 



By definition of i^t+i, for all i = 1, . . . , d we then have ~^2Pj,t e v 



which entails pj^t ^ 4 t + — In -^r^ + — • 

The proof is concluded by taking a convex aggregation with respect to q t . 



, =1 
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In this section we prove shifting regret bounds for the generalized share algorithm. We com- 
pare the cumulative loss Ylt=i pj^t of the forecaster with the loss of an arbitrary sequence of 
vectors q 1 , . . . , q T in the simplex Ad, that is, with Ylt=i Qt^t- The bounds we obtain depend, 
of course, on the "regularity" of the comparison sequence. In the now classical results on track- 
ing the best expert (as in Herbster and Warmuth 1998; Vovk 1999; Herbster and Warmuth 2001; 
Bousquet and Warmuth 2002), this regularity is measured as the number of times q t ^ q t+1 (hence- 
forth referred to as "hard shifts"). The main results of this paper show not only that these results 
may be generalized to obtain bounds in terms of "softer" regularity measures but that the same 
algorithms that were proposed with hard shift tracking in mind achieve such, perhaps surprisingly 
good, performance. Building on the general formulation introduced in Section 2, we derive such 
regret bounds for the fixed-share algorithm of Herbster and Warmuth (1998) and for the algorithms 
of Bousquet and Warmuth (2002). 

In fact, it is advantageous to extend our analysis so that we not only compare the performance of 
the forecaster with sequences g l5 . . . , q T taking values in the simplex of probability distributions 
but rather against arbitrary sequences u%, . . . , ut G R% of vectors with non-negative components. 
The loss of such a sequence is defined by Ylt=i u J^t- For fair comparison, we measure the cumu- 
lative loss of the forecaster by Ylt=i Pt £t\\ut\\i- Of course, when u t G A^, we recover the original 
notion of regret. 

The norms Huil^ , . . . , ||iir|li may be viewed as a sequence of weights that give more or less 
importance to the instantaneous loss suffered at each step. Of particular interest is the case when 
\\ut G [0, 1] which is the setting of "time selection functions" (see Blum and Mansour 2007, Sec- 
tion 6). In particular, considering sequences \\ut ||j G {0, 1} that include the zero vector will provide 
us a simple way of deriving "adaptive" regret bounds, a notion introduced by Hazan and Seshadhri 
(2009). 

The first regret bounds derived below measure the regularity of the sequence uf = (u\ , . . . , ut) 
in terms of the quantity 

T-l 

m(uj ) = ^2 D T y(u t+1 ,u t ) (2) 
t=\ 

where for x = {x x , . . . , x d ),y = (y u ...,y d ) G Rf, we define D TV (x,y) = Y. Xl > y S x i ~ Vi)- 
Note that when x, y G A^, we recover the total variation distance Dtv(x, y) = ^ \\x — y\\ v while 
for general x, y G Mi, the quantity D^y(x, y) is not necessarily symmetric and is always bounded 
by \\x — y\\^. Note that when the vectors Ut are incidence vectors (0, . . . , 0, 1, 0, . . . , 0) G M. d of 
elements i t G [d], then m(uj) corresponds to the number of shifts of the sequence if G [d] T , and 
we recover from the results stated below the classical bounds for tracking the best expert. 

3.1. Fixed-share update 

We now analyze a specific instance of the generalized share algorithm corresponding to the 
update 

d 

Pj,t+i = ^2 + t 1 ~ a ) 1 i=i) v i,t+i = 2 + ( X ~ a H',m ' < a ^ 1 . (3) 

8=1 
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Despite seemingly different statements, this update in Algorithm 1 can be seen to lead exactly to 
the fixed-share algorithm of Herbster and Warmuth (1998) for prediction with expert advice. 

Proposition 2 With the above update, for all T ^ 1, for all sequences £ i , . . . , It of loss vectors 
£ t G [0, l] d , and for all u\, . . . , u T G R+, 



T nun, T 



t=i t=i 1 t=i 



\ u t\\i 



+ m ( u i) ln d + E f =i Kill -m{u{) - 1 ^ 



7] a r] 1 — a 

We emphasize that the fixed-share forecaster does not need to "know" anything about the sequence 
of the norms ||i*t||. Of course, in order to minimize the obtained upper bound, the tuning parameters 
a, r\ need to be optimized and their values will depend on the maximal value of m(uf) for the 
sequences one wishes to compete against. In particular, we obtain the following corollary, in which 
h{x) = — xlnx — (1 — x) ln(l — x) denotes the binary entropy function for x G [0, 1]. We recall 1 
that h(x) < xln(e/x) for x G [0, 1]. 

Corollary 3 Suppose Algorithm 1 is run with the update (3). Let ttiq > 0. For all T ^ 1, for all 
sequences £\,...,£t of loss vectors it G [0, l] d , and for all q^, . . . , q T G with m(qf) ^ too, 



|[(mo + l)lnd + (r-l) h( ""' 



i=l i=l 

whenever r] and a are optimally chosen in terms ofrriQ and T. 

If we only consider vectors of the form q t = (0, . . . , 0, 1, 0, . . . , 0) then m(qf) corresponds to the 
number of times q t+ i ^ q t in the sequence qj. We thus recover Herbster and Warmuth (1998, The- 
orem 1) and Bousquet and Warmuth (2002, Lemma 6) from the much more general Proposition 2. 

Proof of Proposition 2 Applying Lemma 1 with q t = ut/ \\ut\\ v and multiplying by Hutl^, we 
get for all t ^ 1 and u t G Ml 



i d 

«i i Pt «t -u t £ t ^ - 2_^Uijln— h - \\utWx 



''71 '>'■' 

We now examine 

d d / , , s d 



(4) 



V" Uj t In = ( Ui t In — iij f_i In — J +V] ( t _i In — t In J . (5) 

^ Pi.t ^ V Pi,* «i,t/ V i,t Vij+lJ 



For the first term on the right-hand side, we have 

d 



V] ( iH t In — In — J = V] ( (tij t - Ui t-i) In J- + Uj t -\ In ^ 

t-f V Pi,t V i t J V Pi,* P 



i=l ' i:«j,t>"i,t-l 

1. As can be seen by noting that ln(l/(l — x)j < x/(l — x) 



hi 

pi,* 
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+ \( u i,t -Ui, t -i)\n — +u it ln^M. (6) 

.. ^ V v i,t Pi,tJ 



i:Ui ,t<Wi,t-i 



=S0 



^— ' a \ ' — ' < — ' / l — a 

J : Ui t^Ui t-i 



In view of the update (3), we have l/pij ^ d/a and Vi t/pi t ^ 1/(1 — a). Substituting in (6), we 
get 

( "i.i In— Ujt-iln— J 

^ V Pi,t Vi,tJ 

I ^ u*,t-i + ^2 u iit J In 

\i: u i>t ^u it t-i i: w»,t<Wi,t-i / 

= Z) T v(u t ,u t _i)ln- + [ y^Uit - y~) {uit - Uit-i) | In— ^ — . 

a \ ' / 1 — a 

V V ' 

= ||ttt|| 1 — I>Tv(«t,ttt-l) 

The sum of the second term in (5) telescopes. Substituting the obtained bounds in the first sum of 
the right-hand side in (5), and summing over t = 2, . . . , T, leads to 

T d ^ / T \ j 

V" VV t ln4^ < m(«i)ln- + V" - 1 - m(ttf) In- 

t=2 i=l ™> l \t=2 / 



1 1 
+ ^ "Ui i In it j x In 



We hence get from (4), which we use in particular for t = 1, '' 

T d , T 



t=l 77 i=i Pi > 1 8 t=i 



m(uf ) ^ d + Et=i\\ u t\\i-l-™(u{) ln . 



i] a i] 1 — a 

3.2. Sparse sequences: Bousquet-Warmuth updates 

Bousquet and Warmuth (2002) proposed forecasters that are able to efficiently compete with 
the best sequence of experts among all those sequences that only switch a bounded number of 
times and also take a small number of different values. Such "sparse" sequences of experts appear 
naturally in many applications. In this section we show that their algorithms in fact work very well 
in comparison with a much larger class of sequences ui, . . . , that are "regular" — that is, m(u[ ), 
defined in (2) is small — and "sparse" in the sense that the quantity 
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is small. Note that when q t G for all t, then two interesting upper bounds can be provided. First, 
denoting the union of the supports of these convex combinations by S C [d], we have n{q{) ^ \S\, 
the cardinality of S. Also, 

<qi) < |{g t , t = l,...,T}\, 

the cardinality of the pool of convex combinations. Thus, n(uj) generalizes the notion of sparsity 
of Bousquet and Warmufh (2002). 

Here we consider a family of shared updates of the form 

^ \ w i t 

p jt t = (1 - a)v jt t + a-£- , < a < 1 , (7) 

where the w^ t are nonnegative weights that may depend on past and current pre-weights and 
Z% = Yli=i w i,t i s a normalization constant. Shared updates of this form were proposed by 
Bousquet and Warmuth (2002, Sections 3 and 5.2). 

Apart from generalizing the regret bounds of Bousquet and Warmuth (2002), we believe that 
the analysis given below is significantly simpler and more transparent. We are also able to slightly 
improve their original bounds. 

We focus on choices of the weights Wjj that satisfy the following conditions: there exists a 
constant C ^ 1 such that for all j = 1 , . . . , d and t = 1 , . . . , T, 

v j,t < Wj,t < 1 and C Wj }t+ i > wj jt . (8) 

The next result improves on Proposition 2 when T <C d and n(uf) <C m(tif), that is, when the 
dimension (or number of experts) d is large but the sequence uf is sparse. 

Proposition 4 Suppose Algorithm 1 is run with the shared update (7) with weights satisfying the 
conditions (8). Then for all T ^ 1, /or a/Z sequences £\, . . . ,£t of loss vectors It £ [0, l] d , and for 
all sequences u\, . . . , Ur G RjL 

Klli p^t - £«* T 4 < — + + 8 S Mi 

t=l t=l ' ' t=l 

m(ttf ) ^ max^ r Z t XwLi Kill ~ m(uf) - 1 ^ 1 



77 a r/ 1 — a 

Proof The beginning and the end of the proof are similar to the one of Proposition 2, as they do not 
depend on the specific weight update. In particular, inequalities (4) and (5) remain the same. The 
proof is modified after (6), which this time we upper bound using the first condition in (8), 



(u i t In — Ui t-x In — J = y~] (u i t - Ui t-i) In — + Ui,t-i In 

\ Pi.t v iit J ^ Pi.t Pi 

1=1 ' ' i:«i,t^«j,t-l 



Vi,t_ 
t 



+ yZ {ui,t-Uit-i) In— +« it ln^. (9) 
' - — „ ' v it Pi,t 



t:«i,t<«i,t-l <Q 



>la(l/wi, t ) 
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By definition of the shared update (7), we have l/pij ^ %t/ (& w i,t) and v^t/p^t ^ 1/(1 — a). We 
then upper bound the quantity at hand in (9) by 

E K^M-i)ln(^)+ E E ^ ln I^ 

+ V] (u M - u i t _i)ln — 



Z 1 d 

D TV (wt,Wt-i)ln h (||Mt|li - DTv(«t,«*-i)) In- h E ( n i, 

a 1 — a ' 



; - n i t _i)ln — 



Proceeding as in the end of the proof of Proposition 2, we then get the claimed bound, provided that 
we can show that 



T d j 

E E ( u i,t - Uit-i) In ^ n(uT) (In d + Tin C) - \\uA-, lnd, 

^ ^ Wi t 



t=2 i=l 

which we do next. Indeed, the left-hand side can be rewritten as 

T d / . N T 



E E («*,* ln ^ - «* ln ^) + E E 

t=2 i=l v «,t+l/ t=2 i=X 



Un In — Ui,t-l In — 



^ f E E ln c !!' m 1 + f E ^ t ln ~ E ^ ln ' 



> max it,- f > m + > max + in > m i in 

= E ( t ^ ax , T u ^) (< T - x ) ln c + ln ^) - E Ui < i ln ^ > 

where we used C ^ 1 for the first inequality and the second condition in (8) for the second inequal- 
ity. The proof is concluded by noting that (8) entails w^2 ^ (1/C)wi t i ^ {1/C)v^i = l/(dC) and 
that the coefficient max i= i v ..^ u^t — in front of ln(l/tUi 2) is nonnegative. ■ 

We now generalize Corollaries 8 and 9 of Bousquet and Warmuth (2002) by showing two specific in- 
stances of the generic update (7) that satisfy (8). The first update uses Wj : t = max^j Vj jS . Then (8) 
is satisfied with C = 1. Moreover, since a sum of maxima of nonnegative elements is smaller than 
the sum of the sums, Z t ^ min{d, t} ^ T. This immediately gives the following result. 

Corollary 5 Suppose Algorithm 1 is run with the update (7) with Wjj = max s <5( Vj 3 . For all 
T ^ 1, for all sequences £1, . . . ,£t of loss vectors £ t G [0, l] d , and for all q 1 , . . . ,q T E A^, 



E_ T „ -T- n(qT)lnd 77 „ m(qT) T T — m(qT) — 1 , 1 
pj£ t - V qj£ t < + {T+ ln - + ^ ln — 

f— ^ 77 8 77a 77 1 — a 



t=l t=i 
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The second update we discuss uses Wjj = max s ^ t e 7 ' s ^ v j,s in (7) for some 7 > 0. Both 
conditions in (8) are satisfied with C = e 7 . One also has that 

Z t ^d and Zt^y^e-^ = - 1 <- 

as e x ^ 1 + x for all real x. The bound of Proposition 4 then instantiates as 

n(qj) In d | n(q()Tj m(qj ) ^ minjrf, 1/7} T - m(gf ) - 1 ^ 1 



77 n 8 77 a 77 1 — a 

when sequences u t = q t £ are considered. This bound is best understood when 7 is tuned opti- 
mally based on T and on two bounds mo and no over the quantities m(qj) and n(q^). Indeed, by 
optimizing uqTj + mo ln(l/7), i.e., by choosing 7 = mo/(wo ^% one gets a bound that improves 
on the one of the previous corollary: 

Corollary 6 LetrriQ, no > 0. Suppose Algorithm 1 is run with the update Wj t t = max ss gj e^^'Vjg 
where 7 = mo / (no T). For all T ^ 1, for all sequences l\ , . . . , It of loss vectors l t S [0, l] d , and 
for all g l5 . . . ,q T G such that m(qj) ?J mo flw<i n i ( li) ^ n o> Ziave 

no bid mo / . [ noT 



T T 

S^~T e T« ^ n lnd m / . J 

> Pt*-t- > <?t *t < 1 1 + In mm <^ d 



m 



+ ^r + ^lni + T - m °- 1 ln^. 
8 r\ a n 1 — a 

As the factors e _7< cancel out in the numerator and denominator of the ratio in (7), there is a 
straightforward implementation of the algorithm (not requiring the knowledge of T) that needs to 
maintain only d weights. 

In contrast, the corresponding algorithm of Bousquet and Warmuth (2002), using the updates 
Pj,t = ( 1 -a) v j,t + aS^ 1 Y^s^t-i( s - t )~ lv j,s° r Pj,t = (l-a)^,t + aS';" 1 max ss gi_i(s-t) _1 Uj )S , 
where St denote normalization factors, needs to maintain 0(dT) weights with a naive implemen- 
tation, and O(dlnT) weights with a more sophisticated one. In addition, the obtained bounds are 
slightly worse than the one stated above in Corollary 6 as an additional factor of mo ln(l + In T) is 
present in Bousquet and Warmuth (2002, Corollary 9). 

4. Adaptive regret 

Next we show how the results of the previous section, e.g., Proposition 2, imply guarantees 
in terms of adaptive regret — a notion introduced by Hazan and Seshadhri (2009) as follows. For 
tq G {1, . . . , T}, the TQ-adaptive regret of a forecaster is defined by 

^ro-adapt = ^ \y ~T ^ _ miQ y T £ 1 (1Q) 

MC[1,T] q eA d ^ j 

s + 1 — r ^ to 

Adaptive regret is an alternative way to measure the performance of a forecaster against a changing 
environment. It is a straightforward observation that adaptive regret bounds also lead to shifting 
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regret bounds (in terms of hard shifts). Here we show that these two notions of regret share an even 
tighter connection, as they can be both viewed as instances of the same alma mater bound, e.g., 
Proposition 2. 

Hazan and Seshadhri (2009) essentially considered the case of online convex optimization with 
exp-concave loss function (see Section 6 below). In case of general convex functions, they also 
mentioned that the greedy projection forecaster of Zinkevich (2003) — i.e., mirror descent with a 
quadratic regularizer — enjoys adaptive regret guarantees. This forecaster can be implemented on 
the simplex in time 0(d) — see, e.g., Duchi et al. (2008). We now show that the simpler fixed-share 
algorithm has a similar adaptive regret bound. 

Proposition 7 Suppose that Algorithm 1 is run with the shared update (3). Then for all T ^ 1, for 

all sequences &\, . . . , It of loss vectors It 6 [0, l] d , and for all tq 6 {1, . . . , T}, 

v ^ — In — I In h - r . 

rj a T) 1 — a 8 

In particular, when rj and a are chosen optimally (depending on To and T) 

^- adapt < \J^( Toh (^) +lnd ) ^ Hedro) . 

Proof For 1 ^ r ^ s ^ T and q € A^, the regret in the right-hand side of (10) equals the re- 
gret considered in Proposition 2 against the sequence uf defined as Ut = q for t = r, . . . , s and 
= (0, ... ,0) for the remaining t. When r ^ 2, this sequence is such that Dtv ( u r, i^r-i) = 
Dtv(q,Q) = 1 an d Dtv( u s+i, Us) = Dtv(®,q) = so that m(uj) = 1, while ||in [| j = 0. 
When r = 1, we have || x = 1 and m(uf ) = 0. In all cases, m(uf) + 1 1 ixi 1 1 x = 1. Specializing 
the bound of Proposition 2 to the thus defined sequence uf gives the result. ■ 



5. Online tuning of the parameters 

The forecasters studied above need their parameters rj and a to be tuned according to various 
quantities, including the time horizon T. We show here how the trick of Auer et al. (2002) of having 
these parameters vary over time can be extended to our setting. For the sake of concreteness we 
focus on the fixed-share update, i.e., Algorithm 1 run with the update (3). We respectively replace 
steps 3 and 4 of its description by the loss and shared updates 

and p jtt+ i = — + (1 - a t ) v jit +i , (1 1) 

for all t ^ 1 and all j S [d], where (rj T ) and (a T ) are two sequences of positive numbers, indexed 
by t ^ 1. We also conventionally define r/o = rji. Proposition 2 is then adapted in the following 
way (when r] t = r] and at = a, Proposition 2 is exactly recovered). 



P 



Vj,t+1 



M 

vt 



-ndj,t 



-ndi-> 
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Proposition 8 The forecaster based on the above updates (11) is such that whenever rjt ^ f]t-i 
and at ^ o-t-l for all t ^ 1, the following performance bound is achieved. For all T ^ 1, for all 
sequences t\, . . . , It of loss vectors It £ [0, l] d , and for all u±, . . . , Ut £ Ml, 



T T /,, „ T 



+ v 1 7 In ^ + > ^-^-ln + > ttt L 

77T a T f-^ 77t_i 1 - a t j—f 8 



VT a T — 7?t-, - t=1 



Due to space constraints, we only instantiate the obtained bound to the case of T-adaptive regret 
guarantees, when T is unknown and/or can increase without bounds. 



Corollary 9 The forecaster based on the above updates with Vt = y (ln(<ii)) / 't for t ^ 3 and 
Vo = Vi = V2 = ? ?3 OM one /jarccf, a t = 1/t on the other hand, is such that for allT ^ 3 and for 
all sequences t\, . . . ,£t of loss vectors It £ [0, l] d , 

max J ^pjlt ~ min ^ q T £ t 1 ^ v / 2Tln(dT) + x /31n(3d) . 

Proof The sequence n i->- m(n)/n is only non-increasing after round n ^ 3, so that the defined 
sequences of (at) and (rjt) are non-increasing, as desired. For a given pair (r, s) and a given q £ A^, 
we consider the sequence defined in the proof of Proposition 7; it satisfies that m(uf) ^ 1 and 
|| tit || 1 ^ 1 for all t ^ 1. Therefore, Proposition 8 ensures that 

s s T T 

E^t„ t„ lncZ 1 , d(l — ar) 1 , 1 %-l 
^-mm^ 9 T ^ + — ln-i ^+ £ In- + E "IT • 

<(l/^T)EL2 ln (*/(*-l))=(lnT)/ f?T 

It only remains to substitute the proposed values of fft and to note that 

T T— 1 / 

^r ?t _ 1 ^3 % + ^47V / l^^ 3 \/^ + 2 ^\/l^- H 
t=l t=3 V* V 6 



6. Online convex optimization and exp-concave loss functions 

By using a standard reduction, the results of the previous sections can be applied to online 
convex optimization on the simplex. In this setting, at each step t the forecaster chooses p t 6 A^ 
and then is given access to a convex loss It : A^ — > [0,1]. Now, using Algorithm 1 with the 
loss vector i t £ d£t(p t ) given by a subgradient of £ t leads to the desired bounds. Indeed, by the 
convexity of £ t , the regret at each time t with respect to any vector u t £ M+ with \\ut\\i > is then 
bounded as 

Klli(Mp t )-^(^)) < {\\ut\\ lP t-ut) T e t . 



ii 



6.1. Exp-concave loss functions 

Recall that a loss function l t is called ?7o- ex P- concave if e _r?0 * is concave. (In particular, exp- 
concavity implies convexity.) Bousquet and Warmuth (2002) study shifting regret for exp-concave 
loss functions. However, they define the regret of an element qj of the comparison class (a sequence 
of elements in Ad) by 

T 

£ 



k{Pt) ~ Qi tt) (12) 



where £ t = (£t(e{), . . . ,£t(^d)) and e±, . . . , are the elements of the canonical basis of 
This corresponds to the linear optimization case studied in the previous sections. However, due to 
exp-concavity, (1) can be replaced by an application of Jensen's inequality, namely, 



Hence the various propositions and corollaries of Sections 3 and 4 still hold true for the regret (12) 
up to some modifications (deletion of the terms linear in rj, assumption of exp-concavity, bounded- 
ness no longer needed). For the sake of concreteness, we illustrate the required modifications on 
Proposition 4. 

Proposition 10 Suppose Algorithm 1 is run with the shared update (7) with weights satisfying the 
conditions (8) and for the choice f] = T)q. Then for all T ^ 1, for all sequences £i, . . . ,£t of 
r]o-exp-concave loss functions, and for all sequences u±, . . . , ut £ K+, 



V^n , w~\ V- t- (uT)lnd n(uT)TlnC 



t=i t=i m 



m(uf) ^ max f<T Z t Ya=i \\ u t\\\ ~ m(uj) - 1 ^ 



rjo a 7] I - a 

We now turn to the more ambitious goal of controlling regrets of the form Ylt=i {^t (Pt) {it)) 
where losses £ t are exp-concave. Hazan and Seshadhri (2009) constructed algorithms with T- 
adaptive regret of the order of 0(ln 2 T) and running in time poly(d, log T). They also constructed 
different algorithms with T-adaptive regret bounded by O(lnT)) and running time poly(d, T). 

Next, we show the first logarithmic shifting bounds for exp-concave loss functions. However, 
we only do so against sequences qj of elements in A^, i.e., we offer here no general bound in terms 
of linear vectors uj that would unify here as well the view between tracking bounds and adaptive 
regret bounds. Besides, we get shifting bounds only in terms of hard shifts 

s(qJ) = \{t = 2,...,T : q t ± q t _ x }\ . 

Obviously, getting unifying bounds in terms of soft shifts of sequences uf of linear vectors is an 
important open question, which we leave for future research. To get our bound, we mix ideas of 
Herbster and Warmuth (1998) and Blum and Kalai (1997). We define a prior over the sequences of 
convex weight vectors as the distribution of the following homogeneous Markov chain Q l7 Q 2 , . . .: 
The starting vector Q x is drawn at random according to the uniform distribution p, over A^. Then, 
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given Q t i, the next element Q t is equal to Q t _ 1 with probability 1 — a and with probability a is 
drawn at random according to fi. In the sequel, all probabilities P and expectations E will be with 
respect to this Markov chain. Now, the convex weight vector used at time t ^ 1 by the forecaster is 



E 



Pt 



q e -»»i*-i(Qi x ) 



E 



e -VoLt-i(Q\ 



where L t -i(Q 1 



t-u 



t-i 

£ 

s=l 



UQs 



(13) 



(with the convention that an empty sum is null). For this forecaster, we get the following perfor- 
mance bound, whose proof can be found in appendix. 

Proposition 11 For all T ^ 1, for all sequences £±, . . . of r/Q—exp-concave loss functions 
taking values in [0,L], the cumulative loss of the above forecaster is bounded for all sequences 
qi,...,q T £ A d by 



Ew-Ew< WH '" J '" -(i. m 



erjLT 



7] 



rj a 



(a(gf) + l)(d-l) 



In 



Under the imposition of a bound so on the numbers of hard shifts s(qj) and up to a tuning of a in 
terms of so and T, the last two terms of the bound are smaller than T h(so/T) ^ so ln(eso/T) and 
therefore, the whole regret bound is 0((dsoA?o) hiT). 
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Appendix A. Proof of Proposition 8 

We first adapt Lemma 1 . 

Lemma 12 The forecaster based on the loss and shared updates (11) satisfies, for allt^ 1 and for 
all q t E 

~l \Vt-i Pi,t Vt v i)t +i/ \r]t Vt-iJ 8 

whenever rj t ^ f/t-i- 

Proof By Hoeffding's inequality, 



d / d 

T,htkt ^-—to\J2pj 



By Jensen's inequality, since ^ f]t-\ and thus x i— > x it is convex, 



1 1 / ^ \ ~ S / 1 »7t 

sE»=-"-*^E 5--* ' > tor 

Substituting in Hoeffding's bound we get 



'7t 



Now, by definition of the loss update in (11), for all i E [d], 



d Vt i '7t 

which, after substitution in the previous bound leads to the inequality 

valid for all i E [d]. The proof is concluded by taking a convex aggregation over i with respect to 
Qt- ■ 

The proof of Proposition 8 follows the steps of the one of Proposition 2; we sketch it below. 

Proof of Proposition 8 Applying Lemma 12 with q t = Ut/ ||wt|li> and multiplying by 1 1 ix £ 1 1 1 , we 

get for all t ^ 1 and u t E WL, 



15 



d „ d 



\u t \\i pjit - u Jtt < — ^— ^2 ln ~ — Ui t ln — ~ 



rjt-i ~[ Pi,t Vt ~l Vi >t +i 

+ Wi (--—)lnd+^ Klli . (14) 

We will sum these bounds over t ^ 1 to get the desired result but need to perform first some 
additional boundings for t ^ 2; in particular, we examine 

— !— u i t In — "i,* m — — 

~^ Pi,* % ~^ u i,*+l 

= — i>] (uithi— Uit-iln — J +V] ( In— ^ ln — !— J , (15) 

Vt-i ~( \ Pi,t v it tj ~(\Vt-i v i:t m Vi,t+iJ 

where the first difference in the right-hand side can be bounded as in (6) by 

( Ui t In — Ui t—i In ) 

~i V ' Pi,* ' Vi,tJ 

^ \( u i,t - «i,t-i) ln — + u«,t-i ln 5^ ) + V] Ui,t In ^ 

^ V Pi,* Pi,*/ ^ Pi,* 

d /., „ , 1 



^ D T v(tt f ,w t _i)ln h (||wt||i - D T v(ut,u t -i)) ln- 

d(l — ax) 1 

D T v(ut,u t -i) In h ||**t[|i In , (16) 

a.T 1 — at 

where we used for the second inequality that the shared update in (1 1) is such that 1 /p^t ^ d/cit and 
v%,t/Pi,t ^ 1/(1 — a t), and for the third inequality, that at ^ ar and x \-¥ (1 — x)/x is increasing 
on (0, 1]. Summing (15) over t = 2, . . . , T using (16) and the fact that 7] t ^ rfr, we get 

T ( 1 d ll d 1 \ 
/ / u-i t In — } Ui t In 

< ^h ^-^ +f Mlln^-+yf^lnJ--^ln^-y 

^ 77t_i 1 - a t ^ V w i,2 ^,r+iy 



?7r ay 



An application of (14) — including for i = 1, for which we recall that p^i = 1/d and rj\ = r]o by 
convention — concludes the proof. ■ 
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Appendix B. Proof of Proposition 11 

Proof By the definition of exp-concavity and by application of Jensen's inequality to the distribution 
¥ t over (A^)* with density 



1 



E 



e -VoLt-i(r\ x ) 



e -VoL t -i(r{ X ) x i 



with respect to the marginal distribution of P over (A^)*, we have that 
exp(-r/o4(p t )) =exp(-7 ?0 ^(E t [Q i ])) ^ E t [exp(-7? ^(Q t )) 
Thus, a telescoping sum appears, 

T T 



E 



-VoLt(Ql) 



E 



g-wLt-iCQi" 1 ) 



y^£ t (p t ) = V -— lne-^fo) < -— InE 



It suffices to lower bound the expectation. To do so, we define for all sequences r\ the set of the 
sequences of k weight vectors that only shift when r\ does and that at each such shift are e-close 
to the corresponding values of the r t : 



S. 



n , = {s$eX k : V* G {2, . . . , k}, at / a t -i =>- r t ± r t -i 

and Vt € {1, . . . , fc}, s t = (1 — e)rt + £«)( for some tu t G <-f j . 



Note that the second defining constraint is equivalent to the same constraint only at the shifting times 
of r\, in view of the first constraint. Since exp-concave loss functions are in particular convex, we 
get that for all s± G S £ q r , 

T T T T 



t=i 



t=i 



t=i 



t=i 



Thus, 



InE 



< InE 



e -VoL T {Ql) j 



T l 

^5>M+eLT--lnP(S £i(?f ). 



Furthermore, we show by induction on t that for all t ^ 1, 



»(5 6) ,t) ^^(l-a)*"^)- 1 (ae 
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This is true for t = 1 as S £jQl = (1 — e)q 1 + eX has a P-probability given by its /^-probability, 
which is equal to e d ~ x = s d ~ l , and as by convention, s(q 1 ) = 0. Besides, when t ^ 2, we 

have by definition of P (cf. its defining transition probability distributions) and S £jq t (cf. the s\ can 
only shift when the q\ do) that 

FK w P (^-0 when ^ = 9^ 

1^,</J - j a P^-i) /i(5 £ , rt ) = ae^ 1 P(S Ci ,*-i) when q t ± q t x , 

which concludes the induction. 

Substituting the obtained bound, we have proved so far that 

rp rp 

~ £ W ^ £ LT — — \n(s d -\l - a)*-^)" 1 (a^ 1 )^) . 
t=i t=i ^° 

e £ [0, 1] is a parameter of the analysis, it can be optimized to minimize 

eLr+ W) + i)(rf-D ln i 

and get the claimed bound. This is achieved by choosing 



e = mm • 



\ («(«ff) + l) (<*-!) ] 
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